None
The problem of missing values in data is a common issue that can arise during the data collection and preprocessing stages. Missing values can occur for a variety of reasons, such as errors in data entry or measurement, or the intentional withholding of information. These missing values can have a significant impact on the quality and accuracy of any analysis performed on the data.
One of the most important steps in dealing with missing data is to handle it appropriately before any analysis is performed. This is where imputation comes in. Imputation is the process of replacing missing values with estimated ones. The goal of imputation is to minimize the loss of information and bias in the data while maximizing the usefulness of the remaining information.
It is important to address missing values before any analysis because they can lead to biased or incorrect results. For example, if the missing data is not handled properly, it can lead to a biased estimate of the mean or standard deviation. Additionally, many statistical models and machine learning algorithms cannot handle missing data and will either fail or produce inaccurate results.
Overall, imputing missing values is crucial to ensure the validity and accuracy of any analysis performed on the data.
Missing values can occur for a variety of reasons, such as errors in data entry or measurement, or the intentional withholding of information.
Imputation is the process of replacing missing values with estimated ones.
The goal of this notebook is to perform an in-depth analysis of the imputation of missing values using both Spark and Pandas, two widely used data processing libraries. The analysis will include a comparison of the different methods available for imputation in each library, and the robustness of each method will be evaluated using appropriate statistical measures. The notebook will also demonstrate the implementation of these methods in a real-world scenario.
The specific objectives of this notebook are:
To introduce the problem of missing values in data and explain why it is important to address it before any analysis is performed.
To present an overview of the different methods of imputation available in Spark and Pandas.
To evaluate the robustness of each imputation method using appropriate statistical measures.
To demonstrate the implementation of the imputation methods in a real-world scenario.
To compare the results obtained from the methods in Spark and Pandas and draw conclusions about which library is more suitable for imputing missing values in a given context.
Overall, this notebook aims to provide a comprehensive understanding of the imputation of missing values in data using Spark and Pandas, and help data scientists and analysts make informed decisions when dealing with missing data.
During the process of handling missing data, it is important to install the necessary libraries that may not be present in Data Bricks. As these libraries are added, a description and important aspects of each one will be included in the notebook to ensure proper use and understanding. In this notebook is to install the necessary libraries. Some of the libraries needed for this analysis may not be present in the current environment, so it is important to include their description and key aspects when they are added. This process will also ensure that all necessary functions and tools are available for use in the subsequent steps of the notebook.
PySpark is the Python API for Apache Spark, a fast and general engine for large-scale data processing. It is designed to provide a simple and easy-to-use programming interface for parallel computing on clusters of machines. PySpark is built on top of Spark's Java API and exposes the Spark programming model to Python.
#!pip install pyspark
The wordcloud library will be used in this notebook to create visual representations of the data, specifically to compare the missing values using pandas and Spark. This library allows for the creation of word clouds from text data, where the size of each word represents its frequency in the text. This can be useful for identifying patterns and trends in the missing values, and can also be used for exploratory data analysis.
#!pip install wordcloud
import plotly.express._doc as xpdocs
import matplotlib.pyplot as plt
from itables import init_notebook_mode
import itables.options as opt
init_notebook_mode(all_interactive=True)
%config InlineBackend.figure_format = 'retina'
opt.column_filters= 'footer'
opt.classes = 'display nowrap cell-border'
opt.dom = 'lftipr'
opt.search = {"regex": True, "caseInsensitive": True, "smart":True, 'highlight':True}
opt.paging = True
opt.autoWidth=False
opt.showIndex = False
opt.columnDefs=[{"width": "120px", "targets": "_all"}]
Now, we will be importing the data using both Spark and Pandas. It is important to note that both of these libraries have their own unique advantages and disadvantages when it comes to handling large datasets. Spark is highly efficient and can handle large amounts of data with ease, but it can be more difficult to work with than Pandas. On the other hand, Pandas is relatively easy to use and is great for data manipulation and cleaning, but it may struggle with very large datasets. Therefore, it is important to consider the size and complexity of the dataset when deciding which library to use for data import. Additionally, it is also important to keep in mind that the specific use case and desired outcome of the analysis may also play a role in determining which library is best suited for the task at hand.
PySpark is the Python library for Spark programming that allows for easy and efficient processing of large amounts of data using the power of the Apache Spark engine. One of the main advantages of using PySpark in Databricks is its ability to scale up and distribute computations across multiple machines, which can greatly improve the performance of big data processing tasks. One important thing to consider is that in the community version of Databricks, the functionality to read csv files is not available using pd.read.csv.So the code to import data could be longer than Pandas library.
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import SparkSession
#Create Spark Session
spark = SparkSession.builder.getOrCreate()
# File location and type
file_location = "netflix_titles.csv"
# Define the schema
schema = StructType([
StructField("show_id", IntegerType()),
StructField("type", StringType()),
StructField("title", StringType()),
StructField("director", StringType()),
StructField("cast", StringType()),
StructField("country", StringType()),
StructField("date_added", DateType()),
StructField("release_year", IntegerType()),
StructField("rating", StringType()),
StructField("duration", StringType()),
StructField("listed_in", StringType())
])
# The applied options are for CSV files. For other file types, these will be ignored.
spark_data = spark.read.format('csv') \
.option("inferSchema", 'false') \
.option("header", 'true') \
.option("dateFormat", "MMMM d, yyyy")\
.schema(schema) \
.load(file_location)
.option() is a method provided by the Spark DataFrame API that allows you to specify options when reading in data. These options can include things like the file format (e.g. CSV), whether to infer the schema automatically, and how to handle column names. The .option() method takes two arguments: the first is the name of the option, and the second is the value for that option. The options specified with .option() will be used when the .load() method is called to read in the data. Some typical options are:
inferSchema: By default, Spark will try to infer the schema of the data automatically when reading in a CSV file. This can be slow and can also result in unexpected data types. Setting this option to 'false' will prevent Spark from trying to infer the schema and instead use the schema that is specified in the next option, 'schema'.
header: By default, Spark assumes that the first row of the CSV contains the column names. Setting this option to 'true' will tell Spark to use the first row as the column names. If this is set to 'false', Spark will not use the first row as the column names and will instead generate default column names like "col1", "col2", etc.
dateFormat: By default, Spark will read dates in the format "yyyy-MM-dd" but this option allows you to specify any other format of your date field.
schema: This option allows you to specify the schema of the data explicitly. This can be useful if you know the schema ahead of time and want to avoid the overhead of inferring it automatically.
load: this option allow you to specify the path of your CSV file you want to import.
Using these options when importing a CSV file in Spark can help ensure that your data is read in correctly and with the desired schema, and can improve performance by avoiding the overhead of inferring the schema automatically.
opt.maxBytes = spark_data.toPandas().memory_usage().sum()
display(spark_data.toPandas())
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in |
|---|---|---|---|---|---|---|---|---|---|---|
| Loading... (need help?) | show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in |
By using the options in the import statement, we ensure that the dataframe is in the correct format according to the schema specified. For example, by setting the "inferSchema" option to 'false' we are telling Spark to not try to infer the schema automatically and instead use the schema specified in the "schema" option. This ensures that the data is correctly mapped to the desired column types.
The "dateFormat" option is important because it allows you to specify the format of the date field in the CSV file. Without this option, Spark will assume the default format "yyyy-MM-dd" and if the date in CSV is not in this format it will convert it to null. In this case the format to 'date_added' column was 'September 9, 2019' so for coorrect read the format was 'MMMM d,yyyy' By specifying the correct date format, Spark will correctly parse the date field and map it to the correct column in the dataframe.
Overall, these options allow us to have more control over the format of the dataframe and make sure that it is in the correct format according to the schema.
Using pandas for data import is a simpler process compared to PySpark. The read_csv() function allows for easy import of csv files, even when they are stored in a public repository and accessed via a URL. One major advantage of using pandas is that it does not require the creation of a schema for the variables. This can save time and effort when working with large datasets. However, it's important to keep in mind that Pandas is not as efficient as PySpark when working with big data. So, if you are dealing with large datasets, PySpark is the best option.
import pandas as pd
url = 'netflix_titles.csv'
pandas_data = df = pd.read_csv(url, parse_dates=['date_added', 'release_year'])
#pandas_data['date_added'] = pd.to_datetime(pandas_data.date_added, format='%B %d, %Y').dt.date
pandas_data['release_year'] = pd.to_datetime(pandas_data.release_year).dt.year
It is important to have the correct date format in pandas because it allows for accurate manipulation and analysis of the data. Inaccurate date formats can lead to errors or misinterpretation of the data. By using the pd.to_datetime() function and specifying the column, we are able to ensure that the 'date_added' column is correctly formatted as a date, and this allows for accurate analysis of the data based on specific dates. Additionally, using the pd.to_datetime() function to convert the 'release_year' column to datetime data and then extracts only the year component and store it, this way we are able to ensure that the 'release_year' column is correctly formatted as an integer and this allows for accurate analysis of the data based on specific years.
opt.maxBytes = pandas_data.memory_usage().sum()
pandas_data
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Loading... (need help?) | show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description |
pandas_data.columns
Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
'release_year', 'rating', 'duration', 'listed_in', 'description'],
dtype='object')
When reading in a data frame, it is important to make sure that the data is in the correct format for analysis. This is particularly important for columns that represent dates and times, such as the 'date_added' and 'release_year' columns.
In the case of the 'date_added' column, it is formatted as '2019-09-09'. This is a standard format for dates, with the year, month, and day separated by dashes. This format is easily readable and can be easily used for temporal analysis.
The 'release_year' column is also in the correct format, with the year formatted as an integer. This format is useful when performing statistical analysis by year. It is important that the column is in this format so that it can be used for groupby and aggregation operations.
The printSchema() method is a useful tool when working with DataFrames in Spark as it allows you to quickly view the data types and names of columns in a DataFrame. This method displays the schema of the DataFrame in a tree format, with each level of the tree representing a level of nested fields in the schema. By using this method, you can easily identify any issues or errors in the schema, such as incorrect data types. Additionally, it can help you understand the structure of your data and make any necessary adjustments before performing further analysis or operations on the DataFrame.
spark_data.printSchema()
root |-- show_id: integer (nullable = true) |-- type: string (nullable = true) |-- title: string (nullable = true) |-- director: string (nullable = true) |-- cast: string (nullable = true) |-- country: string (nullable = true) |-- date_added: date (nullable = true) |-- release_year: integer (nullable = true) |-- rating: string (nullable = true) |-- duration: string (nullable = true) |-- listed_in: string (nullable = true)
According to the schema provided, there are certain adjustments that need to be made to improve the data. In this case, the 'show_id' column is shown as an integer, however it must be converted to object to avoid including it in statistics.
When a column is defined as an integer, it is considered as a numeric column and it can be used in mathematical operations such as sum, average, standard deviation, etc. However, the 'show_id' column is not a numeric value, it is a unique identifier for each show, so it should not be included in mathematical operations and it should not be used as a measure in any statistical analysis.
Therefore, by converting the 'show_id' column from integer to object, we ensure that this column is treated as a string or categorical data, which is the correct data type for this column. This will prevent any errors or inaccuracies in the analysis of the data and will make the data more meaningful.
In summary, converting the 'show_id' column from integer to object is important to maintain the integrity of the data, and to prevent any errors or inaccuracies in the analysis of the data.
The info() method in the Python library pandas is used to get a summary of the dataframe, including the name of the columns, their data types, and the number of non-null values.
When you use the info() method on a dataframe, it will display the following information:
pandas_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6234 entries, 0 to 6233 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 show_id 6234 non-null int64 1 type 6234 non-null object 2 title 6234 non-null object 3 director 4265 non-null object 4 cast 5664 non-null object 5 country 5758 non-null object 6 date_added 6223 non-null object 7 release_year 6234 non-null int32 8 rating 6224 non-null object 9 duration 6234 non-null object 10 listed_in 6234 non-null object 11 description 6234 non-null object dtypes: int32(1), int64(1), object(10) memory usage: 560.2+ KB
As you can see, the 'show_id' column is currently read as an integer. However, it is important to note that this column serves as a unique identifier for each show and should not be treated as a numerical value for statistical analysis.
pandas_data['show_id'] = pandas_data.show_id.astype('object')
Converting the unique identifier, such as the 'show_id' column, to an object type ensures that it will not be included in any statistical calculations, preserving the integrity of the data and avoiding any inaccuracies in the results. This approach is similar to how it is handled in Spark, where the 'show_id' is treated as an object, ensuring that it is not included in any statistics.
The select method is used in Spark to select specific columns or to perform operations on the columns and create new columns. In this case, it is used to perform a count of the null values in each column of the DataFrame spark_data by using the count function and the when expression to count only the null values. The alias method is used to give the new column the same name as the original column.
Converting the resulting DataFrame to a Pandas DataFrame using toPandas() method has the advantage of being able to use the more advanced and convenient manipulation tools provided by Pandas, such as the sort_values and round methods. This allows you to easily sort the DataFrame by the total number of null values in descending order and round the percentage values to 3 decimal places. Also, with pandas you can use many libraries that are not available on pyspark like matplotlib, seaborn etc.
It is worth noting that converting the DataFrame to Pandas is not always the best option when working with large datasets, as it can cause performance issues due to the need to transfer the data from the Spark cluster to the local machine. However, if the dataset is small enough to fit in memory, converting to Pandas can be a useful tool for data manipulation and analysis.
spark_data_nulls = spark_data.select([count(when(col(c).isNull(), c)).alias(c) for c in spark_data.columns])
spark_nulls = spark_data_nulls.toPandas()
missing_spark_df = pd.DataFrame()
missing_spark_df['category'] = spark_nulls.columns
missing_spark_df['total'] = spark_nulls.iloc[0].values
missing_spark_df['percentage'] = missing_spark_df['total']/ spark_data.count()
missing_spark_df = missing_spark_df.sort_values('total', ascending=0).round(4)
Converting the total counts and percentages of null values in a DataFrame to a separate DataFrame is convenient for several reasons. One of the main benefits is the ability to easily visualize the data. By having the null values data in a separate DataFrame, it makes it simpler to create graphs and charts that show the distribution of null values across the columns. This can be useful for identifying patterns in the data and identifying which columns have a high number of null values, which can be useful in the data cleaning process.
Another advantage is that it facilitates data analysis. By having the null values data in a separate DataFrame, it makes it easier to analyze the data and make decisions about how to handle missing values. For example, you can use the percentage of null values to decide whether to drop a column, impute the missing values, or use some other strategy.
Additionally, having the null values data in a separate DataFrame also makes it easier to compare different datasets. You can compare the null values of two datasets to see how they differ and identify if one dataset is missing more data than the other.
opt.paging=False
display(missing_spark_df)
| category | total | percentage |
|---|---|---|
| Loading... (need help?) | category | total | percentage |
Based on the results of the null values DataFrame generated with Spark, it appears that several columns have a significant number of missing values. The 'director', 'cast', 'country', 'date_added', and 'rating' columns have missing values at a rate of more than 7%.
This information can be used to make decisions about how to handle these missing values. There are several options for dealing with missing data, including dropping the columns, imputing the missing values, or using some other strategy.
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=False)
import plotly.express as px
scale = 3/4
reds = ['salmon','darksalmon','red','crimson','darkred']
reds2 = ['red', 'crimson', 'darkred']
item1 = px.bar(missing_spark_df, y='category', x='total', orientation='h',color= 'category', color_discrete_sequence=['red'], template='plotly_dark', height=625)
def update_style(item, title,size, h, w):
fig = item.update_layout(height=h, width= w, title={'text': title, 'x':0.5, 'xanchor':'center', 'yanchor': 'top', 'font':{'size':size, 'color':'red', 'family':'Balto'}}, xaxis_showgrid=False, yaxis_showgrid=False, #plot_bgcolor='rgba(0, 0, 0, 0)', paper_bgcolor='rgba(0, 0, 0, 0)',
template= 'plotly_dark')
return fig
item1= update_style(item1, 'Spark null values', 40*scale, 600*scale, 1200*scale)
item1.show()
item2 = px.pie(missing_spark_df, values='total', names='category', hole=0.4, template='plotly_dark', color_discrete_sequence=reds[::-1])
item2 = update_style(item2, 'Total Null values percentage',40*scale, 600*scale, 1200*scale)
item2.show()
The graph shows the total number of missing values in each column, divided into categories. The categories are: 'director', 'date_added', 'cast', 'country', 'rating', 'release_year', 'listed_in', 'show_id', 'title', 'duration', and 'type'. The bar chart displays the total number of missing values in each category, and the pie chart displays the percentage from the total of missing values in each category. The results show that the column with the most missing values is 'director' with 1971 missing values, followed by 'date_added' with 661 missing values. The least number of missing values is in the 'show_id' and 'title' columns with only 2 missing values each. The pie chart displays that the 'director' column has the largest percentage of missing values at 53.1% from the total.
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
from wordcloud import WordCloud
cmap = LinearSegmentedColormap.from_list("", ['#f00','#7a1b0c'])
spark_words = [(w, f) for w,f in zip(missing_spark_df.category, missing_spark_df.total)]
wordcloud = WordCloud(width = 1440, height = 190,
background_color ='black',
mode = "RGBA",
colormap = cmap,
min_font_size = 9,
relative_scaling = 0.01,
#stopwords = STOPWORDS,
)
wordcloud.generate_from_frequencies(dict(spark_words))
# show words cloud
plt.figure(figsize = (10, 10), dpi= 300, facecolor = None)
plt.imshow(wordcloud)
plt.margins(x=0, y=0)
plt.axis("off")
plt.show()
Whe we are talking about pandas, we can use the isna() method which is used to check for missing values in a DataFrame or Series. It returns a Boolean mask indicating whether each element is missing (True) or not (False). When this method is used with the sum() method, it returns the number of missing values in each column.
bool_na = pandas_data.isna().sum() > 0
total = pandas_data.isna().sum()[bool_na]
missing_pandas_df= pd.DataFrame()
missing_pandas_df['category'] = total.index
missing_pandas_df['total'] = total.values
missing_pandas_df['percentage'] = total.values/ pandas_data.shape[0]
missing_pandas_df = missing_pandas_df.round(4)
Creating a "tidy" table is an important step when working with data in order to have greater efficiency when generating interactive charts.
display(missing_pandas_df)
| category | total | percentage |
|---|---|---|
| Loading... (need help?) | category | total | percentage |
From these results, we can be concluded that the columns 'director', 'cast' and 'country', have the highest number of missing values, while 'date_added' and 'rating' have the lowest number of missing values. Overall, it seems that the dataset has a moderate amount of missing values.
item3 = px.bar(missing_pandas_df, y='category', x='total', orientation='h', color= 'category', color_discrete_sequence=['red'], template='plotly_dark')
item3 = update_style(item3, 'Pandas null values', 40*scale, 600*scale, 1200*scale)
item3.show()
item4 = px.pie(missing_pandas_df, values='total', names='category', hole=.4, opacity=.7, color_discrete_sequence= reds[::-1], template='plotly_dark')
item4 = update_style(item4, 'Total Null values percentage', 40*scale, 600*scale, 1200*scale)
item4.show()
The comparison of the total missing values between Spark and Pandas shows that Pandas has lower missing values in the columns "director", "cast", "country", and "rating". The total number of missing values in the "director" column is 1969 in Pandas and 1971 in Spark. The total number of missing values in the "cast" column is 570 in Pandas and 571 in Spark. The total number of missing values in the "country" column is 476 in Pandas and 478 in Spark. And the most important result was the total number of missing values in the "date_added" column which is 11 in Pandas and 661 in Spark**.
cmap = LinearSegmentedColormap.from_list("", ['#f00','#7a1b0c'])
pandas_words = [(w, f) for w,f in zip(missing_pandas_df.category, missing_pandas_df.total)]
wordcloud = WordCloud(width = 1440, height = 190,
background_color ='black',
mode = "RGBA",
colormap = cmap,
min_font_size = 1,
relative_scaling = 0.03,
#stopwords = STOPWORDS,
)
wordcloud.generate_from_frequencies(dict(pandas_words))
# show words cloud
plt.figure(figsize = (10, 10), dpi= 300, facecolor = None)
plt.imshow(wordcloud)
plt.margins(x=0, y=0)
plt.axis("off")
plt.show()
The merge function in pandas can be used to compare and combine the missing values identified by Spark and pandas. The function links rows of two dataframes based on one or more common columns, known as the "key" columns. It is possible to specify the type of join to be performed using the "how" parameter, which can take values such as "left", "right", "outer", or "inner".
Additionally, the merge function can handle duplicate column names by appending a suffix to the column names of one of the dataframes. This can be done using the "suffixes" parameter, which takes a tuple of two strings to be appended to the column names of the left and right dataframes, respectively.
missing_spark_pandas = missing_spark_df.merge(missing_pandas_df, on='category', how='outer', suffixes=('_spark', '_pandas')).fillna(0)
missing_spark_pandas['difference'] = missing_spark_pandas.total_spark - missing_spark_pandas.total_pandas
missing_spark_pandas['total_cum_spark'] = missing_spark_pandas.total_spark.cumsum()
missing_spark_pandas['total_cum_pandas'] = missing_spark_pandas.total_pandas.cumsum()
missing_spark_pandas['diff_cum'] = missing_spark_pandas.difference.cumsum()
display(missing_spark_pandas)
| category | total_spark | percentage_spark | total_pandas | percentage_pandas | difference | total_cum_spark | total_cum_pandas | diff_cum |
|---|---|---|---|---|---|---|---|---|
| Loading... (need help?) | category | total_spark | percentage_spark | total_pandas | percentage_pandas | difference | total_cum_spark | total_cum_pandas | diff_cum |
For each category in the dataset, the number of missing values and the percentage of missing values are provided for both Spark and Pandas. The difference between the two is also calculated. The cumulative total and cumulative difference between Spark and Pandas is also calculated.
Based on the results, it appears that there are differences in the number of missing values found by Spark and Pandas for several categories. For example, for the "director" category, Spark found 2 more missing values compared to Pandas. For the "date_added" category, Pandas found 650 fewer missing values than Spark. These differences may indicate that the two libraries handle missing data differently.
So, we are going to work with the data imported using pandas.
The reasons are:
Advantages:
Exceptions:
Plotly templates are a great way to customize the look and feel of your plots. They allow you to easily change the colors, fonts, and other visual elements of your plots without having to write any code. This can be useful when you want to quickly create a plot that looks good and is easy to read.
import plotly.io as pio
pio.templates
Templates configuration
-----------------------
Default template: 'plotly'
Available templates:
['ggplot2', 'seaborn', 'simple_white', 'plotly',
'plotly_white', 'plotly_dark', 'presentation', 'xgridoff',
'ygridoff', 'gridon', 'none']
item5 = px.bar(missing_spark_pandas, x='category', y=['total_spark', 'total_pandas', 'difference'], orientation='v', barmode='group', color_discrete_sequence=reds2, template='plotly_dark')
item5 = update_style(item5, 'Total Null Values Comparision', 40*scale, 600*scale, 1200*scale)
item5.show()
item6 = px.line(missing_spark_pandas, x='category', y=['total_spark', 'total_pandas','difference'], color_discrete_sequence=reds2, template='presentation')
item6 = update_style(item6, 'Total Null Values Comparision', 40*scale, 600*scale, 1200*scale)
item6.show()
From the bar and line graphs that were generated, it can be seen that the number of missing values for each column in the imported data sets of Spark and Pandas were compared. The difference in missing values for the "date_added" column is evident, with a difference of 650. The graph highlights that in this case, Pandas was better optimized in processing the date format compared to Spark, which had difficulty correctly processing the date format. This can be concluded by comparing the missing value count of "date_added" in Spark (661) to that in Pandas (11).
item7 = px.line(missing_spark_pandas, x='category', y=['total_cum_spark', 'total_cum_pandas'], color_discrete_sequence=reds2, template='presentation')
item7 = update_style(item7, 'Total Cumulative Null Values Comparision', 40*scale, 600*scale, 1400*scale)
item7.update_layout(template='presentation')
item7.show()
In the line graph, it is evident that the data imported through Spark has a higher cumulative total of 3713 null values, compared to the data imported through Pandas with a cumulative total of 3036 null values. This difference can be attributed to the significant number of null values present in the data imported through Spark. The graph clearly demonstrates this disparity between the two datasets.
word= list(missing_spark_df.category.values)+list(missing_pandas_df.category.values)
freq= list(missing_spark_df.total.values)+list(missing_pandas_df.total.values)
cmap = LinearSegmentedColormap.from_list("", ['#f00','#7a1b0c'])
words = [(w, f) for w,f in zip(word, freq)]
wordcloud = WordCloud(width = 1440, height = 249,
background_color ='black',
mode = "RGBA",
colormap = cmap,
min_font_size = 1,
relative_scaling = 0.03,
#stopwords = STOPWORDS,
)
wordcloud.generate_from_frequencies(dict(words))
# show words cloud
plt.figure(figsize = (10, 10), dpi= 300, facecolor = None)
plt.imshow(wordcloud)
plt.margins(x=0, y=0)
plt.axis("off")
plt.show()
In many cases, data is collected and used solely for informational purposes, such as creating dashboards or visualizations. In such scenarios, it is not necessary to perform complex imputation methods, such as machine learning, to fill in missing values. Instead, a simple approach, such as removing rows with missing values or replacing them with a central tendency measure, can be sufficient.
This approach is known as simple imputation and is suitable for datasets where the goal is to provide a general overview rather than making predictions or performing in-depth analysis. In the case of Netflix movies, for example, missing values may be present in the data for various reasons, such as data entry errors or missing information from the source.
By removing rows with missing values or replacing them with a central tendency measure, such as the mean or median, we can still provide a general overview of the data and display it in a meaningful way. This approach is also quick and easy to implement, making it a practical solution for data that is used solely for informational purposes.
data = pandas_data.copy()
item8 = px.imshow(data.isna(), color_continuous_scale=['red', 'black'],
template='plotly_dark', range_color=[0,1],
labels=dict(x="Columns", y="No. Record", color="Null Values"))
# Add color bar legend in range (0,1)
item8.update_layout(coloraxis_colorbar=dict(
title="",
tickvals=[0, 1],
ticktext=["Not Nulls", "Nulls"],
ticks="inside",
thickness=15,
tickfont=dict(size=15),
outlinecolor="lightgray",
outlinewidth=1
))
item8.update_layout(paper_bgcolor='rgba(0, 0, 0, 1)')
item8 = update_style(item8, 'Null Values Matrix', 40*scale,450, 1440*scale)
item8.show()
opt.paging = False
display(missing_pandas_df)
| category | total | percentage |
|---|---|---|
| Loading... (need help?) | category | total | percentage |
missing_types= data[missing_pandas_df.category].dtypes.reset_index()
missing_types.columns = ['category', 'data_type']
display(missing_types)
| category | data_type |
|---|---|
| Loading... (need help?) | category | data_type |
opt.paging= True
display(data[data.director.isna()])
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Loading... (need help?) | show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description |
data.director.fillna('no director', inplace=True)
display(data.director)
| director |
|---|
| Loading... (need help?) | director |
The 'director' column was identified as having null values. To address this issue, a simple and straightforward solution was employed. The missing values were replaced with the string 'no director' string.
display(data[data.cast.isna()])
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Loading... (need help?) | show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description |
data.cast.fillna('no cast', inplace=True)
data.cast
| cast |
|---|
| Loading... (need help?) | cast |
The 'cast' column was identified as having null values. To address this issue, a simple and straightforward solution was employed. The missing values were replaced with the string 'no cast' string.
display(data[data.country.isna()])
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Loading... (need help?) | show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description |
data.country.fillna('no country', inplace=True)
data.country
| country |
|---|
| Loading... (need help?) | country |
The 'country' column was identified as having null values. To address this issue, a simple and straightforward solution was employed. The missing values were replaced with the string 'no country' string.
display(data[data.date_added.isna()])
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Loading... (need help?) | show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description |
In the case of columns 'date_added' and 'rating', the null values represent a very small percentage of the total data, around 0.2%. Therefore, the most straightforward solution is to simply drop these rows and clean the data, rather than imputing or replacing these values with any other information. This approach reduces the risk of introducing inaccuracies or biases in the data, and helps to maintain the integrity of the information being presented in the dashboard. By removing the null values in these columns, the data will be more consistent and provide a clearer representation of the available information.
data.dropna(subset=['date_added', 'rating'], inplace=True)
data[['date_added','rating']]
| date_added | rating |
|---|---|
| Loading... (need help?) | date_added | rating |
count_missing = data.isna().sum().reset_index()
count_missing.columns = ['category', 'null values sum']
display(count_missing)
| category | null values sum |
|---|---|
| Loading... (need help?) | category | null values sum |
item9 = px.imshow(data.isna(), color_continuous_scale=['red', 'black'],
template='plotly_dark', range_color=[0,1],
labels=dict(x="Columns", y="No. Record", color="Null Values"))
# Add color bar legend in range (0,1)
item9.update_layout(coloraxis_colorbar=dict(
title="",
tickvals=[0, 1],
ticktext=["Not Nulls", "Nulls"],
ticks="inside",
thickness=15,
tickfont=dict(size=15),
outlinecolor="lightgray",
outlinewidth=1
))
item9.update_layout(paper_bgcolor='rgba(0, 0, 0, 1)')
item9 = update_style(item9, 'Null Values Matrix', 40*scale,450, 1440*scale)
item9.show()
data.to_csv('netflix_titles_clean.csv', index=False)
The results of the visualization show that the data set no longer contains null values and is almost ready to create the dashboard. The graph represents the presence of null values in the data set and the use of color represents the concentration of missing values. The use of red shows higher concentration of missing values, while the darkred color indicates lower or no missing values.
!jupyter nbconvert --to html_toc --theme jupyterlab_miami_nights --output Netflix_Null_Values.html Netflix_Null_Values.ipynb
[NbConvertApp] Converting notebook Netflix_Null_Values.ipynb to html [NbConvertApp] Writing 13047719 bytes to Netflix_Null_Values.html